PyData Amsterdam 2017

From Fourier to Deep Convnets

A.k.a. Signal Processing for Data Science


Ivo Everts, Marcel Raas, Gabriele Modena


https://github.com/godatadriven/pydata-2017-dsp-tutorial

Introduction (45 min.)

  • Signal processing applications
  • Convolution
  • Fourier analysis

Feature engineering (50 min.)

  • 1D signals: filter banks, pooling
  • 2D signals: bag-of-words

Break (15 min.)

Feature learning (30 min.)

  • PCA / LDA
  • Convolutional neural network

Speech processing hackathon (40 min.)

  • Speaker / gender / age / nationality recognition...
  • ...based on Fourier / bag-of-words / convnet

About us

  • Ivo Everts, Data Scientist at GoDataDriven, BSc+MSc+PhD in AI (biased to Computer Vision).

  • Marcel Raas: Data Scientist at GoDataDriven, BSc+MSc+PhD in physics.

  • Gabriele Modena: former collegue deserving kudos here not only because he is in the conference committee.

Info

Copyright © 2017 GoDataDriven. All rights reserved. Unless otherwise indicated, all materials on these pages are copyrighted by GoDataDriven. All rights reserved. No part of these pages, either text or image may be used for any purpose other than personal use. Therefore, reproduction, modification, storage in a retrieval system or retransmission, in any form or by any means, electronic, mechanical or otherwise, for reasons other than personal use, is strictly prohibited without prior written permission.

(but please use the code if you can;)

In [156]:
import util, os, multiprocessing as mp
util.small_canvas_size(); util.figsize();
# viz
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import gridspec, mlab
%pylab inline
splot = subplot
# num
import pandas as pd
from scipy import signal, stats, ndimage
from sklearn import *
Populating the interactive namespace from numpy and matplotlib

Introduction

  • 'Traditional data' comes from the (relational) database in rows and columns
  • Data is usually stored in a dataframe for processing, in Python, R, Spark, ...
  • Columns typically represent clearly defined variables
  • Apart from data cleaning and basic preprocessing, the data can be used as such by a machine learning algorithm, e.g. to predict income

    id  age  gender  city  income
    01   35       m   ams    NULL
    02   20       m   rot   40000
    03   20       f   rot   35000
    04   40       f   ams   50000
    ..  ...  ......  ....  ......

Nontraditional data

  • Non- (or at least less-) traditional data is e.g. text, speech, music, video, images, financial series, seismic waves, wifi/bluetooth counts, ...
  • In a machine learning context, all those data have in common that they are unstructured and thus some sort of feature extraction needs to precede further processing

Applications

Digital signal processing

A signal can broadly be viewed as a function

$$f(\mathbf{t}) \rightarrow \mathbf{v}$$

$$\mathbf{t} \in \mathcal{R}^N, \mathbf{v} \in \mathcal{R}^M.$$

Thus, mathematical operators can be applied to signals - this is already quite different from traditional data!

Usually, t denotes a step in time (i.e. N=1) and v is some scalar measurement at the moment of t (i.e. M=1). These are time series such as simple sensor measurements (temperature, humidity, seismic waves, people counts), financial transactions, etc. For e.g. images, t denotes image coordinates (row, column) and v is the (R,G,B) pixel value.

In [3]:
# some quasi-periodic function definition
def fn(t):
    return sin(t) + cos(t*2) + 2*random.rand(len(t))
In [160]:
plot_f();

In order to represent this data for usage by a machine learning algorithm, we need to perform some sort of preprocessing and feature extraction.

In [161]:
# example
plot_inout_example();

There are challenges due to noise and signal length; interesting features lie in the periodicity of the signal; this contrasts with simple global statistics such as max, mean.

Convolution a.k.a. filtering

The preprocessing of and feature extraction from signals typically involves an operation called convolution, commonly denoted as $$(f * g)(t)$$ where f is our original function, and g is some other function.

Our data consists of a finite set of measurements at discrete steps, i.e. samples. This allows for a user-friendly mathematical notation of the convolution: $$(f*g)[\tau]=\sum_{t=-T}^T f[\tau-t]g[t].$$

Filtering with the Gaussian kernel

Probably the most common operation in signal processing is smoothing - which is achieved by convolving the signal f with a (Gaussian) smoothing filter g. This can be used for noise reduction in e.g. audio signals, and blurring of images.

In [162]:
plot_gaussian_convolution_example()
In [163]:
# Basically any numericish package/module/language
# has functionality for performing convolution 
# and associated kernel creation.
# It is an elementary operation for vector/matrix manupilation
# and thus relevant in particular to DSP.
plot_gaussians()

Multi-scale processing

The Gaussian kernel is parameterized by its standard deviation, so we can control that.

In [164]:
# go through the code
plot_multiscale_convolution_example()

Kernel density estimation

As an important side note of statistical nature: given the observed density below, how would convolution come in handy?

In [165]:
plot_gaussian_kde_example()

Higher dimensions... images!

In [166]:
# example of smoothing a 2D signal - i.e. image blurring!
image = random.rand(100, 100)
for sigma, i in zip(sigmas, range(len(sigmas))):
    splot(1, len(sigmas), i+1);
    util.imshow(ndimage.filters.gaussian_filter(image, sigma));
    util.xlabel('sigma={:.2f}'.format(sigma))
In [167]:
# couple of options for showing a 2D gaussian filter
splot(131);
image = zeros((11, 11));
image[5, 5] = 1;
util.imshow(util.imscale(
    ndimage.filters.gaussian_filter(image, 3)
))
splot(132);
x, y = meshgrid(range(101), range(101))
z = mlab.bivariate_normal(x, y, mux=50, muy=50, 
                                sigmax=10, sigmay=10)
util.imshow(util.imscale(z))
splot(133, projection='3d').plot_surface(x, y, z);

So, I guess you are starting to get those convolutional net imagery...!?

Besides smoothing, one of the most often used applications of convolution on images results in edge detection, which lies at the heart of many visual recognition systems.

In [168]:
city = util.imread('city.jpg');
forest = util.imread('forest.jpg');
splot(121); util.imshow(util.detect_edges(city));
splot(122); util.imshow(util.detect_edges(forest));

Let's start with less complex images

In [169]:
# not all digits are available in our mini sample
digits = [0,1,3,4,5,6,8,9]
for i in range(len(digits)):
    splot(2, 4, i+1);
    util.imshow(get_digit(digits[i]));
In [22]:
# now this should be fully understood (and appreciated;)
hybrid_image_plot(get_digit(5))

The image is just a 2D function $I = f(row, column), I \in [0, 255]$*, so we can apply math to it.

* a pixel is usually stored with 1 byte per channel, and converted to a float in [0,1] for ease of computation.

Now we try to understand how edge detection works

In [24]:
digit = get_digit(8)
show_image_derivatives(digit)

These images are created by convolving with a derivative filter. The image gradient is defined as the vector $[\delta_x, \delta_y]$. It's magnitude $\sqrt{{\delta_x}^2+{\delta_y}^2}$ corresponds to the amount of contrast; it's orientation $\arctan(\delta_y / \delta_x)$ corresponds to the edge direction.

A derivative filter: Sobel

In [25]:
smoothing  = array([+1, +2, +1]).reshape(3,1)
derivative = array([+1,  0, -1]).reshape(1,3)
sobel_derivative = dot(smoothing, derivative)
splot(131);
util.imshow(util.imscale(sobel_derivative));
util.title('Derivative filter')
splot(132);
im = array([[1, 0.5, 0],[1, 0.5, 0],[1, 0.5, 0]])
util.imshow(im); util.title('Image')
splot(133);
util.imshow(util.imscale(
    ndimage.convolve(im, sobel_derivative
)));
util.title('Image derivative')

Now for dy as well

In [26]:
show_image_derivatives(im)

The Gaussian derivative filter

Convolving a signal with a Gaussian filter smooths the signal. Convolving a signal with a 'difference' filter like Sobel sort of differentiates the signal. Convolving a signal with a Gaussian derivative filter...

In [28]:
show_gaussian_orders()

Now we can extract multiscale differential features from images

In [31]:
show_gaussian_scale_orders()

Back to lower dimensions

We can do the same with 1D signals

In [32]:
# finally, some custom code for a gaussian function ;)
def g(s=1, o=0, n=9):
    x = arange(n)-int(n/2)
    e = exp( - pow(x, 2) / (2*pow(s,2)) )
    p = sqrt(2 * pi)
    def g_0():
        return e / (s * p)
    def g_1():
        return -(e*x) / ( pow(s, 3) * p )
    return {0: g_0, 1: g_1}[o]()
plot(g(s=5,o=0,n=31)); plot(g(s=5,o=1,n=31));
util.legend(['Gaussian', 'Gaussian derivative'])

Sobel vs Gauss

In [33]:
v = rand(10) * 10
# Raw Sobel filtering
plot(ndimage.filters.sobel(v))
# Gaussian derivative
plot(convolve(g(s=1,o=1,n=3), v, 'same'))
plot(v)
util.legend(['Sobel derivative',
             'Gaussian derivative',
             'Raw signal'])
xticks(range(len(v))); # so that we can see clearly...
util.figsize()

Fourier analysis

Let's have a look at the following stock market graph:

The graph looks quite random, but there seems to be a repeating pattern. How could we determine the periodicity?

We could smooth the signal (Gaussian smoothing with parameter $\sigma$) to get rid of most of the noise, followed by making a distribution of the distances of the $k$ nearest neighbours. The period will then hopefully be visible as a peak in this distribution.

Why use the Fourier transform?

The Fourier transformation will tell you how to break up a signal into signals that are periodic. You will get all frequencies/periods in your signal and it will tell you how much it contributes to the total as well!

Note: Let's not try to fully understand the math, but rather focus on the concepts and Python implementations.

Clearly it has an application in sound data, where the periodicity corresponds to the pitch of a sound, e.g. notes in music and intonation in speech.

The signal can be extended to 2D data. So also image data can be decomposed into frequencies.

This is for instance used in image compression, where the data for some frequencies that are not really important for the picture's perception is thrown away.

In [34]:
im = util.imread('images/jjfourier.jpg', edit_dir=True)
trans = fft.rfft2(im)
window_size = 10
trans[:, window_size:trans.shape[1]-window_size] = 0
util.image_subplots([im, fft.irfft2(-trans)])

How does it work? Approximately ;)

$Sinusoidal Wave = Amplitude * \sin(Frequency * Time + Phase)$

$f(t) = A \cdot \sin(2\pi{\cdot}f{\cdot}t + \phi)$

In [36]:
# What happens when you add 2 harmonics
HTML('<iframe src=https://betterexplained.com/examples/fourier/?cycles=0,1,1 width=600 height=280></iframe>')
Out[36]:

Any function can be written as a sum of harmonics

In [38]:
plot(X, y, color="blue", label="y")
plot(X, f, color="red", label="f(t)")
legend(fontsize=15, loc='upper left');

A.k.a. the Fourier series $$f(t) = a_0 + \sum_{n=1}^\infty [a_n \cos(\omega_n t) + b_n \sin(\omega_n t)]$$

The Discrete Fourier Transform

$$F_k = \frac{1}{N} \sum_{n=0}^{N-1}{f_n e^\frac{-2 i \pi n k}{N}}$$

where:

  • N is the number of samples in the time domain
  • $F_k$ is the DFT for frequency $k$
  • $k$ is the frequency
  • $f_n$ is the value of the signal at time $n$

And mr. Euler came up with this: $e^{i \varphi} = \cos \varphi + i \sin \varphi$ so we see the relationship between complex exponentials and harmonic functions.

The discrete Fourier transform of a real sequence of numbers will be a sequence of complex numbers of the same length

Working with FFT in python

In [40]:
frequency = 5.0
phs = np.linspace(0, frequency * 2 * pi, 128, endpoint=False)
X = cos(phs)
fft_bins = rfftfreq(len(X), d=1/float(len(X)))
F = real(fft.rfft(X))
fft_plot((phs, X), (fft_bins, F))
In [41]:
noisy_X = add_gaussian_noise(X)
noisy_F = real(fft.rfft(noisy_X))
fft_plot((phs, noisy_X), (fft_bins, noisy_F))

The Convolution Theorem

Recall: a signal can smoothed by convolving with a Gaussian filter.

In [44]:
def f(t):
    return sin(t) + cos(t*2) + 2*random.rand(len(t))

g = gaussian(3, 1)
t = arange(255)
v = f(t)
plot_convolution_result(g, t, v)

We can determine the frequencies in the signal by using the Fourier transform.

In [45]:
freq = rfftfreq(len(t), d=1/float(len(t)))
power_spectrum = absolute(fft.rfft(v))**2
idx = argsort(freq)
plot(freq[idx], power_spectrum[idx]);
title('Power spectrum of rfft(v)')
xlabel('freq'); ylabel('power');

Now let's look at the power spectrum of the convolved signal.

In [47]:
plot(freq[idx], power_spectrum_convolved[idx]);
title('Power spectrum of the convolution v * (g/sum(g))')
xlabel('freq'); ylabel('power');

As expected, the high frequencies are attenuated after smoothing the signal.

Now let's look at the power spectrum of the Gaussian filter

In [49]:
plot(freq[idx], power_spectrum_window[idx]);
util.title('Power spectrum of the Gaussian filter (zero - padded)')
util.xlabel('freq'); util.ylabel('power');

And finally: the product of the spectra of the filter and the original signal.

In [50]:
prod = power_spectrum_window * power_spectrum
plot(freq[idx], prod[idx]);
util.title('Product of the window and signal power spectra ')
util.xlabel('freq'); util.ylabel('power');

What the F!? This looks exactly the same as the spectrum of the filtered signal. Could this be a theorem?

The Convolution Theorem says that given two functions $f$ and $g$:

$$ \mathrm{FT}(f * g) = \mathrm{FT}(f) \cdot \mathrm{FT}(g). $$

Meaning that convolution in the time domain is equivalent to multiplication in the frequency domain.

The FFT is applied on speech data in the last section of this tutorial.

Feature engineering

Why are we not always using the Fourier Transform for representing signals? Come on, think about it: with the FT we can represent any signal precisely!

This is a classic: as opposed to precisely modelling the data, we nowadays collect examples and extract features for representing that data in a statistical formulation of the problem. A.k.a. machine learning.

Invariance

Let's think about factors in the process of signal measurement against which the representation should be invariant.

  • Background noise in audio
    • Cocktail party
  • Illumination conditions in an image
    • Shadow, shading, highlights
  • ...

In feature extraction for machine learning, we aim to find the optimal trade-off between discriminative power and invariance.

  • what is the most discriminative representation of a signal?

  • what is the most invariant representation of a signal?

  • what is other common terminology to denote this phenomenon?

  • what is the most discriminative representation of a signal?
    • the raw signal itself
  • what is the most invariant representation of a signal?
    • a representation that is always the same, independent of the signal
  • what is other common terminology to denote this phenomenon?
    • bias/variance tradeoff

Feature extraction using filter banks

We can represent a signal by it's impuls response to a set of filters, commonly refered to as a filter bank. The filters are designed such that a signal's space/time and frequency characteristics are adequately measured.

A filter response represents the similarity between the signal and the filter, while no explicit assumptions are being made about the underlying data-generating process. This generally results in robust representations.

So let's have a look, and get some data first. We'll stick with 1 dimension for now.

In [52]:
signals, labels = util.make_datamarket_health_dataset()
plot_signals(signals, labels)
print('{} samples'.format(len(signals)))
660 samples

Interesting properties of these data are: min, max, mean, periodicity, ... So we should choose appropriate filters and extract features by convolution.

In [53]:
plot_signals(signals, labels)

Gabor filters

A popular filter for capturing space/time/frequency characteristics is the Gabor filter. These are mostly well-known in image processing but generally apply to signals of any dimension.

The Gabor transform is one of many so called band pass filters that allows you to 'cut' the Fourier transform and isolate only specific information.

This is rather complex, but in the following we'll just assume that a Gabor filter is basically a sinusoid multiplied by a Gaussian, with which we can do cool stuff.

In [55]:
# create a gabor filter based on 
#   a gaussian with std s and 
#   a sinusoid with frequency f
def make_gabor_filter(s, f, sample_rate=None):
    sample_rate = s/10 if sample_rate is None else sample_rate
    t = arange(-s, +s, sample_rate)
    A = 1; p = 0;
    sinusoid = A * np.cos(f * t + p)
    gaussian = signal.gaussian(len(t), s**2)
    return sinusoid * gaussian

plot_filter(make_gabor_filter(pi/.5, 1, 0.1));

Now create a filter bank by considering a set of parameters for constructing Gabor filters

In [57]:
# create a filter bank from the given set of parameters
def make_gabor_filter_bank(params, sample_rate=1):
    return [make_gabor_filter(s, f, sample_rate) \
            for (s,f) in params]

# make a sensible set of parameters for filter bank construction
def get_gabor_filter_bank_params(
        # default parameter set
        std = arange(pi/2, pi/.5, pi/2),
        freq = arange(1, 4, 1)):
    # joint parameter set
    x, y = meshgrid(std, freq)
    return array([x.flatten(), y.flatten()]).T
In [58]:
plot_gabor_filter_bank(get_gabor_filter_bank_params())

Signal classification using filter banks

Let's move on to a classification round using filter banks: which signal represents which disease?

In [59]:
plot_signals(signals, labels);

For constructing fixed-length feature vectors, we need to convolve the signal and aggregate the filter response using some aggregator. This is know as Feature Pooling.

In [60]:
# default feature aggregation, a.k.a. pooling
default_feature_aggregators = [mean, std, max]

# apply a given set of functions the the given data
def apply_all(x, fs):
    return array([f(x) for f in fs])

# apply a filterbank to a signal
# and extract features using the provided aggregators
def apply_filter_bank(v, fb, aggregators):
    fb_result = [apply_all(np.convolve(v, f, 'valid'),\
                           aggregators)\
                 for f in fb]
    return array(fb_result).flatten()

For each of the 9 filters, we pool the result with a mean, std and max aggregator. This yields 27 numbers.

In [61]:
# build the filter bank
gabor_params = get_gabor_filter_bank_params()
fb = make_gabor_filter_bank(gabor_params)

# apply the filter bank to a signal
apply_filter_bank(signals[0], fb, default_feature_aggregators)
Out[61]:
array([ 1495.23176041,   939.92067101,  3367.1978114 ,  -551.52858247,
        1014.9391915 ,  1443.31771911, -1694.35771167,  1292.39585964,
         296.61427513,  -484.51676414,   300.38765524,    50.11674388,
         567.52926912,   430.0839714 ,  1560.15736545,  -547.93679011,
         525.11152748,   336.15900611,  -164.65933555,   108.19364065,
         -22.36140403,  -593.83879869,   312.31241484,  -145.02812401,
         446.11985246,   285.7219919 ,  1035.56172359])

Train and test the features.

In [62]:
# apply the filterbank to all signals
def extract_gabor_features(signals, fb, aggregators):
    return array([apply_filter_bank(signal, fb, aggregators)\
                  for signal in signals])

gabor_features = extract_gabor_features(
                    signals, fb, default_feature_aggregators)
gabor_features.shape
Out[62]:
(660, 27)
In [63]:
# leave-one-out classification round with gabor features
result1 = util.l1o_model_validation(gabor_features, labels)
print('gabor features l1o: {:.0f}%'.format(100*result1))
gabor features l1o: 99%

Note: it is not said that this is the best approach with the best parameters for this problem, but we've come a long way in 1 try ;)

Also note that usually max-pooling results in the most discriminative representation.

In [64]:
# store the 1D gabor features for later usage
save('data/feats1D.npy', 
     hstack([gabor_features, labels.reshape((len(labels),1))]));

2D Gabor filter banks: extracting image texture

In [65]:
# gabor kernels are readily available in scikit-image
from skimage.filters import gabor_kernel
In [67]:
# prepare filter bank kernels
util.figsize(20, 8)
kernels, params = get_gabor_kernels(True)
for i in range(len(kernels)):
    splot(4, 4, i+1); # 16 kernels assumed
    util.imshow(hstack([util.imscale(real(kernels[i])), 
                        util.imscale(imag(kernels[i]))]))
    util.title(params[i])
util.figsize()

Let's extract Gabor features from an image...

In [68]:
wood1 = util.imread('images/vertical_wood.jpg');
wood2 = util.imread('images/horizontal_wood.jpg');
util.image_subplots([wood1, wood2])

...and look how they represent the image content.

In [70]:
wood1_gabor_feats = compute_kernel_feats(mean(wood1, axis=2),
                                         kernels)
wood2_gabor_feats = compute_kernel_feats(mean(wood2, axis=2),
                                         kernels)
plot_feat_hists(wood1_gabor_feats, wood2_gabor_feats)
util.legend(['vertical mean','horizontal mean'])
util.xlabel('Feature index')
util.title('Gabor features on horizontal and vertical textures')

Constructing a visual vocabulary with Gabor features

Here, we'll see the concept of creating a visual vocabulary from local features that can be used for constructing bag-of-visual-words image representations. Then, this model will be applied to a bunch of images from cats and dogs using another type of popular image feature (SIFT) - which has dominated the computer vision literature between Fourier and Deep learning (roughly;).

Extract local features from an image

In [264]:
local_gabor_features = compute_local_kernel_feats(
                            mean(city, axis=2),
                            kernels,
                            descriptors_only=False)
locs = local_gabor_features[:, :3]
feats = preprocessing.StandardScaler().\
            fit_transform(local_gabor_features[:, 3:])
util.imshow(city)
feats.shape
Out[264]:
(16068, 16)

Compute bag-of-visual-words image descriptor

In [265]:
# do clustering of the features to construct 10 visual words
k = 10
gabor_vocab = cluster.KMeans(k).fit(feats)

# map the gabor features to their 'prototypes'
p = gabor_vocab.predict(feats)
image_descriptor = histogram(p, bins=k)[0]

# create the bovw descriptor: the distribution of local features over visual words
image_descriptor / sum(image_descriptor)
Out[265]:
array([ 0.09366443,  0.04288026,  0.24166044,  0.06198656,  0.05072193,
        0.11426438,  0.05146876,  0.21521036,  0.0238362 ,  0.1043067 ])

Visualize a bunch of image patches that have been mapped to the same visual word.

In [275]:
word_nr = 6
l = random.permutation(locs[p==word_nr])
for i in range(len(l)):
    if i < 24:
        splot(3, 8, i+1)
        util.imshow(get_patch(city, l[i], zoom=1))

Now the same vocabulary can be used for representing any other image

In [273]:
bim = make_border_image(200)
bim_gabor_feats = compute_local_kernel_feats(
                        bim, kernels, descriptors_only=False)
bim_locs = bim_gabor_feats[:, :3]
bim_feats = preprocessing.StandardScaler().\
                fit_transform(bim_gabor_feats[:, 3:])
util.imshow(bim)
bim_feats.shape
Out[273]:
(900, 16)

And we can see that similar structures are mapped to the same word

In [276]:
bim_p = gabor_vocab.predict(bim_feats)
bim_l = random.permutation(bim_locs[bim_p==word_nr])
for i in range(len(bim_l)):
    if i < 24:
        splot(3, 8, i+1)
        util.imshow(get_patch(bim, bim_l[i], zoom=1))

And maybe this 3D Gabor application is a nice way to conclude the Gabor stuff.

Visual recognition with bags-of-visual-words

Before the rise of deep learning with convolutional neural networks, the status quo in computer vision was constituted by the bag-of-visual-words model. This model involves 1) adequately representing local image features 2) constructing a visual vocabulary for vector quantizing the feature vectors and 3) training a classifier with the image features and associated labels (e.g. dog, cat). In the following we'll touch on the aspects involved.

For this occasion, I have hidden most of the code and I will skip the details because at the end of the tutorial I recommend you pip install Keras anyway ;)

BOVW 1) Representing image features

Probably the most cited article in computer vision is David Lowe's paper on SIFT: the Scale Invariant Feature Transform. These are local image features that have dominated computer vision, and are typically better suitable for general vision problems than Gabor features.

There's quite a lot to it - here we'll just mention the Feature Transform part: it uses features based on the image gradient, which we saw earlier in the tutorial. But scroll through the paper for the fun of it.

Note the importance of local features for robustness against geometric distortions due to scale, rotation, viewpoint and occlusion.

Let's just extract local image features using Mahotas, as that came up as easy and free to use - SIFT is patented and Mahotas provides a convienent lib for extracting a fast approximation called SURF (elected in 2016 as ECCV's best paper of 10 years ago).

In [77]:
# import surf from Mahotas
from mahotas.features import surf

ucity = util.float3d_to_uint2d(city)
uforest = util.float3d_to_uint2d(forest)

# detect feature locations and extract feature vectors
# let's not worry about all the possible parameter settings
city_features = surf.surf(ucity)
forest_features = surf.surf(uforest)
In [78]:
# Plotting the local features on the image canvas,
# we can see the different scales and orientations.
util.figsize(12, 12)
util.imshow(surf.show_surf(ucity, city_features[:250]))
util.figsize()

BOVW 2) Representing image features

So now that we have a way to extract robust local features, we can proceed by learning the visual vocabulary. For this, we'll use dogs and cats.

In [80]:
print(len(labels), 'files,',
      sum(labels), 'dogs,',
      len(labels)-sum(labels), 'cats')
100 files, 50 dogs, 50 cats
In [81]:
util.image_subplots([util.imread(filepaths[1], edit_dir=False),
                     util.imread(filepaths[-2], edit_dir=False)])
In [82]:
util.image_subplots([util.imread(filepaths[0], edit_dir=False),
                     util.imread(filepaths[-1], edit_dir=False)])

That looks like a pretty hard problem

We'll be extracting dense features, because that usually performs better than sampling from keypoints (more=more). And, in addition to that, in this lecture we're using only 100 images but we need enough data to be able to construct the visual vocabulary.

In [243]:
# construct the vocabulary (may take a while)
num_words = 100
surf_vocab = create_visual_vocabulary(filepaths, num_words)
surf_vocab
Out[243]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=100, n_init=10, n_jobs=-1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

BOVW 3) Construct image descriptors and use your favorite classifier for visual recognition

In [244]:
def extract_surf_descriptor(filepath):
    global surf_vocab
    return extract_image_descriptor(filepath, surf_vocab)

with mp.Pool(6) as descriptor_pool:
    image_descriptors = array(descriptor_pool.map(
                                        extract_surf_descriptor,
                                        filepaths))
image_descriptors.shape
Out[244]:
(100, 100)

Go bananas

In [246]:
# evaluate performance of cat/dog recognition
for cls in [KNeighborsClassifier(1), LogisticRegression(),\
            SVC(kernel='rbf'), SVC(kernel='linear')]:
    res = util.l1o_model_validation(image_descriptors,
                                    array(labels),
                                    classifier=cls)
    print('{}\t{:.0f}%'.format(cls.__str__()[:50], 100*res))
KNeighborsClassifier(algorithm='auto', leaf_size=3	59%
LogisticRegression(C=1.0, class_weight=None, dual=	64%
SVC(C=1.0, cache_size=200, class_weight=None, coef	56%
SVC(C=1.0, cache_size=200, class_weight=None, coef	63%

Ok, well, let's just stick with this. In practice you will also have to do multi-scale sampling, cross-validating the number of visual words and model hyper-parameters, consider different implementations for feature extraction, use a richer bovw model like VLAD or Fisher Vectors, with spatial pyramid, etc, etc

Feature learning

Because we are mortal

Having executed all those feature engineering tasks, we are left with big matrices of numbers that are difficult to interpret - usually more so than with 'traditional data'. So we should invoke some learning step that identifies the relevant features. Here, we consider Principal Component Analysis and Linear Discriminant Analysis.

Load the features and labels from previous sections

In [249]:
# best cat/dog model in previous run was a linear SVM
bovw_cls = SVC(kernel='linear')
In [250]:
gabor_features, disease = load_numpy_dataset('data/feats1D.npy')
gabor_res0 = util.l1o_model_validation(gabor_features, disease)
bovw_features, pet = load_numpy_dataset('data/feats2D.npy')
bovw_res0 = util.l1o_model_validation(bovw_features, pet,
                                      classifier=bovw_cls)
print('gabor0: {:.2f}\nbovw0: {:.2f}'.\
      format(gabor_res0, bovw_res0))
gabor0: 0.99
bovw0: 0.63

Principal Component Analysis and Linear Discriminant Analysis

With PCA we can obtain an unsupervised dimensionality reduction, while LDA is a supervised technique.

Using increasing amounts of principal components for classifying diseases. With a small number of features we get almost optimal performance!

In [94]:
# this takes a while
plot_progressive_pca_validation(gabor_features,
                                disease, 'disease')

Using increasing amounts of principal components for classifying pets. With a small number of features we get better performance than with all features!!

In [251]:
plot_progressive_pca_validation(bovw_features,
                                pet, 'pet',
                                classifier=bovw_cls)

Linear Discriminant Analysis

In [96]:
# use PCA for selecting 2 features
pca_feats = decomposition.PCA(n_components=2) \
                .fit_transform(gabor_features)

# use LDA for selecting 2 features
lda_feats = discriminant_analysis.LinearDiscriminantAnalysis() \
                .fit_transform(gabor_features, disease)
In [98]:
pca_lda_plot(pca_feats, lda_feats, disease)

Making the features better linearly seperable does not help so much when using a nearest neighbor classifier!

Use a linear SVM to boost performance on LDA features (a bit)

In [99]:
print('PCA+SVM: {:.0f}%\nLDA+SVM: {:.0f}%'.format(
    100*util.l1o_model_validation(pca_feats, disease,
                                  svm.SVC(kernel='linear')),
    100*util.l1o_model_validation(lda_feats, disease,
                                  svm.SVC(kernel='linear'))))
PCA+SVM: 83%
LDA+SVM: 93%

Plus, we see that performance on the PCA-ed feats decreases!

Convolutional Neural Networks (CNNs)

We have learned in this tutorial what convolution and pooling is. A CNN consists of several more layers for mapping input signals to output classes, providing an end-to-end learning mechanism for both the representation and the classification function.

convnet

The input layer of a CNN consists of filters that are used to extract features from the input signal by convolution. For images, the learned filters resemble the Gabor filters quite a lot.

convlayer

Ok, so now we don't have to put engineering effort in feature extraction, but we have to think about / experiment with:

  • number of convolution/pooling/connected layers
  • number of filters
  • filter size
  • narrow vs wide convolution
  • stride size
  • channels

The following describes one such combination of choices

Image classfication with a CNN

The mnist written digits dataset is a common benchmark for image classification and ships with Keras itself:

In [101]:
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
util.image_subplots((squeeze(X_train[0,:,:]),
                     squeeze(X_train[1,:,:])))
Using TensorFlow backend.

The data is further preprocessed (not shown), and with Keras it is easy to build a CNN:

In [103]:
from keras.models import Sequential
model = Sequential()
In [104]:
input_shape = (28, 28, 1)
n_filters = 32
filter_size = 3
pool_size = 2

We can create a filter_size * filter_size convolutional layer as an instance of Convolution2D

In [105]:
from keras.layers.convolutional import Convolution2D

conv_layer = Convolution2D(n_filters, filter_size, filter_size,
                           border_mode='valid',
                           input_shape=input_shape)
model.add(conv_layer)

Add activation (do / not send signal to the next layer) and pooling layers:

In [106]:
from keras.layers import Activation
from keras.layers.convolutional import MaxPooling2D

model.add(Activation('relu'))
model.add(Convolution2D(n_filters, filter_size, filter_size))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(pool_size, pool_size)))

And finally map the convolution and pooling results to the output categories (0-9):

In [107]:
from keras.layers import Flatten, Dense

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dense(n_classes))
model.add(Activation('softmax'))

Train the model, using batch_size samples per iteration, and observing the whole dataset nb_epochs times.

In [108]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, y_train,
          batch_size=512, nb_epoch=2,
          validation_data=(X_test, y_test));
Train on 60000 samples, validate on 10000 samples
Epoch 1/2
60000/60000 [==============================] - 86s - loss: 0.3595 - acc: 0.8976 - val_loss: 0.1078 - val_acc: 0.9685
Epoch 2/2
60000/60000 [==============================] - 79s - loss: 0.0865 - acc: 0.9749 - val_loss: 0.0623 - val_acc: 0.9805
In [109]:
score = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', score[1])
Test accuracy: 0.9805

Visual recognition with pretrained deep convnets

There are deep models trained on millions of images and classes, right underneath your fingertips. We can use those to classify images, or to extract features and use those for our classification problem.

In [110]:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input

Using the dog/cat images still pointed to by filepaths, extract deep features using a pretrained CNN from Keras, with special thanks to Oxford's Visual Geometry Group.

In [112]:
model = VGG16(weights='imagenet', include_top=False)
with mp.Pool(6) as vgg_pool:
    vgg_feats = array([extract_vgg_features(img_pp(filepath),
                                            model)
                       for filepath in filepaths])

Show us the money!

In [113]:
util.l1o_model_validation(vgg_feats, pet)
Out[113]:
0.53000000000000003

...what a disappointment!

Recall that we are using the output of the usually one-but-last layer, so we probably have to take one more transformation step. Let's use PCA.

In [114]:
ncs = arange(5)+1
pca_res = []
for nc in ncs:
    pca = decomposition.PCA(n_components=nc)
    pca_res.append(util.l1o_model_validation(vgg_feats,
                                             pet,
                                             normalizer=pca))

With PCA applied to the deep features, we achieve almost perfect performance!

In [115]:
plot(ncs, pca_res);
xlabel('# principal components'); ylabel('# accuracy');
ylim([0.88,1]); xticks(ncs);
title('Cat vs Dog recognition with VGG deep feats and PCA');

(Like, OMG!!##!? WTF????!!&%$! is this real)

Whooohoooooo!

In [117]:
scatter(pca_vgg_feats[labels==0,0],
        pca_vgg_feats[labels==0,1], c='b');
scatter(pca_vgg_feats[labels==1,0],
        pca_vgg_feats[labels==1,1], c='r');
legend(['cats','dogs'], fontsize=15, loc='upper left');

Speech processing hackathon

Only if there is time left, otherwise just try this at home.

In [118]:
# standard audio sample rate (samples per second)
sample_rate = 44100.

From a couple of speakers we have 2 recordings and associated metadata, so we can build models from speech data to infer that metadata.

In [119]:
from metadata import *

print(list(people.keys()))
people['Ivo']
['Marcel', 'Ivo2', 'Andrew', 'Nelli', 'Gabriele', 'Ivo', 'Jelte', 'Ron']
Out[119]:
{'age': 35, 'gender': 'male', 'mood': 'neutral', 'nationality': 'Dutch'}
In [120]:
sentences
Out[120]:
{'s1': 'I want to implement a speech recognition system',
 's2': 'the fourier transform is super cool but rather complex'}

I recorded the audio on my mac with quicktime, resulting in m4a files. With ffmpeg you can convert between many audio, image and video formats. For this tutorial I shipped only the wav files.

In [121]:
# point to folder where data is stored
data_folder = 'data/speech_samples/'

def m4a2wav(data_folder):
    m4a_list = [os.path.join(data_folder, file_name)
                for file_name in os.listdir(data_folder)
                if file_name.endswith('m4a')]
    for m4a_file in m4a_list:
        os.system('ffmpeg -y -i {} {}'.\
                  format(m4a_file, m4a_file[:-3]+'wav'))

# convert all m4a files
for person in people.keys():
    m4a2wav(os.path.join(data_folder, person))

IPython provides awesome utilities

In [122]:
from IPython.display import *

# play a wav file
def play_sound(data):
    global sample_rate
    if type(data) == str:
        return Audio(filename=data,
                     autoplay=False, rate=sample_rate)
    else:
        return Audio(data.astype(int16),
                     autoplay=False, rate=sample_rate)
    
play_sound(os.path.join(data_folder, 'Ivo', 's1.wav'))
Out[122]:

Load the speech data

In [123]:
from scipy.io import wavfile

# return mono signal and normalize to 1/-1 as most loud/quite
def read_wave(filepath):
    sample_rate, audio_channels = wavfile.read(filepath)
    return audio_channels[:,0] /\
            float(np.max(np.abs(audio_channels[:,0])))
    
s1 = read_wave(os.path.join(data_folder, 'Nelli', 's1.wav'));
s2 = read_wave(os.path.join(data_folder, 'Gabriele', 's1.wav'));

Plot the signals

In [124]:
subplot(121); plot(s1); title('s1 pronounced by Nelli');
xlabel('time'); ylabel('normalized amplitude');
subplot(122); plot(s2); title('s1 pronounced by Gabriele');

Extract features using the Fast Fourier Transform

In [126]:
p1, f1 = fourier_transform(s1); p2, f2 = fourier_transform(s2);
plot(f1, p1, label='Nelli'); plot(f2, p2, label='Gabriele');
pyplot.xlim([0, 1000]); pyplot.legend(fontsize=15);
xlabel('frequency'); ylabel('power');

It looks like we can use the frequency distributions to discriminate between Nelli and Gabriele. Let's chop it up for a compact representation.

In [127]:
# extract the amount of power in a given frequency range
def power_at(power, freq, lbound, hbound):
    lbin = argmin(abs(freq - lbound))
    hbin = argmin(abs(freq - hbound))
    # TODO: correct for non-exact localization
    return sum(power[lbin:hbin])

Create a dataset

In [128]:
# dataset params
freq_step = 100
freq_max = 1000
lbounds = arange(0, freq_max, freq_step)

# store features and labels
features = []
labels = []

# keep waves in memory for later usage
waves = []
In [129]:
for person in people.keys():
    for s in sentences.keys():
        w = read_wave(os.path.join(data_folder, person, '{}.wav'\
                                   .format(s)));
        p, f = fourier_transform(w)
        features.append(array([power_at(p, f, b, b+freq_step) 
                               for b in lbounds]))
        labels.append([person,
                       people[person]['nationality'],
                       people[person]['gender'],
                       people[person]['mood'],
                       people[person]['age'],
                       s])
        waves.append(w)

Create dataframe (code not shown)

In [131]:
dataset.head(5).T
Out[131]:
0 1 2 3 4
f0 -0.421025 -0.404128 -0.43439 -0.414063 -0.180684
f1 -0.628028 -0.880296 0.704927 0.819942 -0.84289
f2 0.272792 -0.325877 -0.661536 -0.301881 -1.24012
f3 -0.583158 -0.949314 1.89822 -0.490668 -0.330657
f4 -0.932387 -0.898587 2.30568 0.19562 -0.616166
f5 -0.850168 -0.661366 3.16831 0.24307 -0.861628
f6 0.247941 -0.525531 0.648278 -0.244405 -1.36581
f7 0.874248 -0.533401 1.78762 1.67368 -0.9359
f8 2.5395 0.624118 -0.491731 -1.10561 -0.777151
f9 3.39059 -0.301294 -0.0401555 0.528934 -0.627082
name Marcel Marcel Ivo2 Ivo2 Andrew
nationality Dutch Dutch Dutch Dutch Australian
gender male male male male male
mood neutral neutral angry angry neutral
age 33 33 36 36 40
sentence s2 s1 s2 s1 s2
n_name 5 5 3 3 0
n_nationality 1 1 1 1 0
n_gender 1 1 1 1 1
n_mood 1 1 0 0 1
n_age 2 2 4 4 5
n_sentence 1 0 1 0 1

Use LDA for an informative plot

In [134]:
lda_plot()

Bag-of-audio-words using n-jet filters

In [136]:
for i, f_spike in enumerate(f_spikes):
    subplot(2, 2, i+1);
    extend = 150*int(params[i,0])
    plot(f_spike[len(f_spike)/2-extend:len(f_spike)/2+extend]);
    xticks([]); yticks([]); title(params[i,:])

Look at an example of local features over time

In [138]:
feats = extract_local_features(waves[0], aggs=[max, mean])
imshow(feats.T); colorbar();
In [140]:
# code not shown
bow_feats.shape
Out[140]:
(16, 10)

Use the LDA plot to illustrate the bow features

In [141]:
lda_plot(bow_feats)

Do a round of classification on fft and bow features, with the raw/pca/lda version.

In [143]:
fft_bow_cls_wrapper()
*** name ***
fft: 0.38/0.31/0.38, bow: 0.12/0.12/0.12

*** nationality ***
fft: 0.81/0.81/0.62, bow: 0.44/0.50/0.31

*** gender ***
fft: 0.94/0.94/0.88, bow: 0.75/0.75/0.62

*** mood ***
fft: 0.75/0.69/0.69, bow: 0.62/0.69/0.62

*** age ***
fft: 0.44/0.38/0.44, bow: 0.12/0.12/0.19

*** sentence ***
fft: 0.56/0.56/0.56, bow: 0.69/0.88/0.62

On these data, fft based features seem to perform well

Convolutional neural networks

A CNN can also be trained on 1D signals such as speech. Here, it does not make much sense due the small amount of data, but if you have larger amounts I suggest you use this.

In the following, we'll build a 1D CNN for discriminating between the 2 sentences.

After preprocessing the data (not shown) we can train the model

In [154]:
# fit the model using train- and validation- data
model.fit(pwaves[:8], keras_labels[:8],
          batch_size=4,
          nb_epoch=5,
          validation_data=(pwaves[8:12], keras_labels[8:12]));
Train on 8 samples, validate on 4 samples
Epoch 1/5
8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 4.9573 - val_acc: 0.5000
Epoch 2/5
8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 4.9898 - val_acc: 0.5000
Epoch 3/5
8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 5.0168 - val_acc: 0.5000
Epoch 4/5
8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 5.0393 - val_acc: 0.5000
Epoch 5/5
8/8 [==============================] - 1s - loss: 1.1921e-07 - acc: 1.0000 - val_loss: 5.0579 - val_acc: 0.5000

Maybe we can do any good after all.

In [155]:
res = model.evaluate(pwaves[12:], keras_labels[12:], verbose=0)
print('Accuracy = {}'.format(res[1]))
Accuracy = 0.5

Not really, but that's ok

We can visualize the filters that have been learned. We set the number of filters to 2 and the filter size to 100 samples.

In [147]:
filters = model.layers[0].get_weights()
for i, f in enumerate(filters[0].reshape(filter_size, num_filters).T):
    plot(f, label='filter {}'.format(i+1))
legend();

And we can see that a 100-sample snippet of a wave sort of resembles the filters:

In [148]:
plot(pwaves[0][900:1000]);

Recap

  • Convolution
  • Fourier analysis

Feature engineering

  • 1D signals: filter banks, pooling
  • 2D signals: bag-of-words

Feature learning

  • PCA / LDA
  • Convolutional neural network: cat/dog to perfection

Speech processing hackathon

  • Speaker / gender / age / nationality recognition...
  • ...based on Fourier / bag-of-words / convnet

Thank you and have a great time at PyData.

ivoeverts@godatadriven.com